Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype for storage API #7

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Prototype for storage API #7

wants to merge 16 commits into from

Conversation

jyoune
Copy link

@jyoune jyoune commented Jul 10, 2024

Basic skeleton for a prototype storage api for MMIF files that creates nested subdirectories based on the views and corresponding metadata present.

Copy link
Member

@keighrim keighrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastapi is a uncharted territory for me as well. Let's keep researching on it.

@keighrim
Copy link
Member

For reference, this PR addresses the "storage" side of clamsproject/aapb-evaluations#50 .

Copy link
Member

@keighrim keighrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, looks good, I left some additional feature request/suggestion in the comments.

from typing import List, Dict
from typing_extensions import Annotated
import os
import yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this yaml is used only for a limited purpose, namely to load config file. However, the existing code is based on .env file and dotenv module, so let's migrate to environment variable-based configuration to match the existing code.

return {"message": "Storage api for pipelined mmif files"}


@app.route("/upload_mmif/", methods=["POST"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's think about the routings here. Existing "baapb resolver" codebase (api/__init__.py) uses many routings, but only one of them (/searchapi/) is a pure API landing point (others are attached to some web page with minimal front end implementation).

Eventually, we want to run a single server app that can handle "assets" (video files) as well as can handle "mmif", meaning this storage_api.py should be merged into the api/ pacakge. To that end, I think we need some basic rules for naming these routing paths.

Since /searchapi is already used in production server (for baapb:// resolution), how about

  • /upload_mmif >> /storeapi/mmif
  • /retrieve >> /searchapi/mmif
    ?

Note that changes in the routing names will also impact the client-side implementation (that you're currently working on).

# and dump the param dicts
os.makedirs(directory, exist_ok=True)
for path in param_path_dict:
file_path = os.path.join(path, 'parameters.json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out, and figured that it would be more convenient for clients if we just store .mmif files only in this directory, and have an additional api to return the entire absolute path of the directory (just like how baapb:// is resolved), so that when we run some code that uses pre-stored mmif files on the same machine where those files are, the client doesn't have to "download" the mmif files, but directly can access the directory (if exists) and use the files under.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we store the parameter files in the "appversion" directory, and keeping only mmif files in the "hash" directory, clients can just take * glob to read all mmif files, without worrying about any additional metadata-like files.
So instead of

storage-root/
├── app1
│   ├── v1
│   │   ├── hash1
│   │   │   ├── guid1.mmif
│   │   │   └── params.json
│   │   ├── hash2
│   │   │   ├── guid1.mmif
│   │   │   └── params.json
│   │   └── hash3
│   │       ├── guid1.mmif
│   │       └── params.json
│   └── v2
│       └── hash1
│           ├── guid1.mmif
│           └── params.json
└── app2
    ├── v1
    │   └── hash1
    │       ├── guid1.mmif
    │       └── params.json
    └── v2
        └── hash1
            ├── guid1.mmif
            └── params.json

,

storage-root/
├── app1
│   ├── v1
│   │   ├── hash1
│   │   │   └── guid1.mmif
│   │   ├── hash1.json
│   │   ├── hash2
│   │   │   └── guid1.mmif
│   │   ├── hash2.json
│   │   ├── hash3
│   │   │   └── guid1.mmif
│   │   └── hash3.json
│   └── v2
│       ├── hash1
│       │   └── guid1.mmif
│       └── hash1.json
└── app2
    ├── v1
    │   ├── hash1
    │   │   └── guid1.mmif
    │   └── hash1.json
    └── v2
        ├── hash1
        │   └── guid1.mmif
        └── hash1.json

pipeline = pipeline_from_param_json(data)
# get number of views for rewind if necessary
num_views = len(data['pipeline'])
guid = data.get('guid')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned you want to work on (or already on it?) multi-guid query scenario. As mentioned in other comment, how about also adding zero-guid query, to return just the full directory path?

jyoune added 5 commits August 7, 2024 15:43
…uration, changed route names, added "zero-guid" functionality to return just the absolute path for the pipeline, and changed directory level for storing parameter hashes (from hash dir to version number dir)
@keighrim
Copy link
Member

A few additional suggestions after using it for uploading in recent days.

  1. zero-guid scenario and "rewind" feature: because of the rewind feature, I planned a regular "garbage collection" process to clean mmif files from non-terminal directories. However this needs to be more thought through since when the garbage collection is in place, the zero-guid query can return an empty directory.
  2. overwrite: what should happen when a payload for an upload request conflicts with an existing file?
    1. Can we just blindly overwrite?
    2. Can we just blindly reject the upload?
    3. Should we conduct some sort of "deep-diff" between two MMIF files and decide?
  3. document locations: we need to decide whether we want to allow users (uploaders) to use file:// scheme for document location (which is not persistent and possibly only available on the user's personal device), or only allow baapb:// locations for consistency and reproducibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants